Harden RAG PDF ingestion by kkudumu · Pull Request #8 · aietal/aimengpt

kkudumu · 2026-05-14T10:39:50Z

Part of the open Algora bounty for [ISAAC-497] Implement an enhanced RAG Pipeline for Scientific/Research Workflows.

/claim #45

Bounty reference: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu

Summary

move PDF chunk metadata preparation into a tested server helper
make ingestion tolerate missing PDF title/source/page metadata instead of throwing while processing uploaded research PDFs
skip blank chunks before writing to Chroma so retrieval does not surface empty context
return a clear 400 response when the upload request does not include a PDF
remove full-document console logging from the upload path

Why this helps the scientific RAG bounty

Scientific PDFs often have incomplete or inconsistent parser metadata. The current upload path assumes metadata.pdf.info.Title, metadata.source, and metadata.loc.pageNumber are always present, so a single malformed parsed chunk can crash ingestion before the RAG pipeline can retrieve anything. This PR is a focused ingestion reliability slice that complements the existing citation/reranking/context PRs.

Demo

Demo video: https://raw.githubusercontent.com/kkudumu/aimengpt/codex/rag-ingest-hardening/ui/docs/rag-ingest-demo.mp4

Verification

From ui/:

npx vitest run __tests__/rag-ingest.test.ts
npx prettier --check pages/api/inject-documents.ts utils/server/rag-ingest.ts __tests__/rag-ingest.test.ts
npx tsc --noEmit --pretty false
npm run lint -- --file pages/api/inject-documents.ts --file utils/server/rag-ingest.ts --file __tests__/rag-ingest.test.ts
git diff --check

Results:

Targeted Vitest suite passed: 4 tests
Prettier check passed
TypeScript check passed
ESLint passed
git diff --check passed

AI-Assisted Disclosure

This contribution was produced with AI assistance and manually reviewed/verified before submission.

kkudumu · 2026-05-14T10:50:40Z

Hi maintainers / @algora-pbc, quick visibility note for the Isaac/AimenGPT RAG bounty: this PR targets a distinct ingestion-hardening gap in the scientific RAG flow rather than duplicating the citation/reranking/context-budgeting PRs already open.

Bounty reference: https://algora.io/isaac/bounties/clq18zr98000ejs0gt0nv7gwu

Current verification from ui/:

npx vitest run __tests__/rag-ingest.test.ts -> 4 passed
npx prettier --check pages/api/inject-documents.ts utils/server/rag-ingest.ts __tests__/rag-ingest.test.ts -> passed
npx tsc --noEmit --pretty false -> passed
npm run lint -- --file pages/api/inject-documents.ts --file utils/server/rag-ingest.ts --file __tests__/rag-ingest.test.ts -> passed
git diff --check -> passed

The main behavior change is making uploaded research PDFs with incomplete parser metadata ingest reliably instead of crashing before retrieval can happen.

kkudumu · 2026-05-14T10:54:53Z

Follow-up pushed in 94474be to handle one more ingestion edge case: if a PDF parses into only blank chunks, the upload route now returns a clear 400 instead of calling Chroma with empty arrays.

Verification after the update from ui/:

npx vitest run __tests__/rag-ingest.test.ts -> 5 passed
npx prettier --check pages/api/inject-documents.ts utils/server/rag-ingest.ts __tests__/rag-ingest.test.ts -> passed
npx tsc --noEmit --pretty false -> passed
npm run lint -- --file pages/api/inject-documents.ts --file utils/server/rag-ingest.ts --file __tests__/rag-ingest.test.ts -> passed
git diff --check -> passed

kkudumu · 2026-05-14T12:08:09Z

Follow-up pushed in 0bc1bd2 to preserve citation metadata in scientific RAG chunks.

What changed:

Extracts the first DOI from chunk text and stores it as primitive Chroma metadata.
Extracts the first publication year and defaults to 0 when absent.
Adds a stable 16-character sourceHash derived from source path and page number so retrieved chunks can be grouped back to the originating source/page without leaking document content into IDs.
Keeps missing values Chroma-compatible with empty string / zero defaults.

Verification:

npx vitest run tests/rag-ingest.test.ts
npx prettier --check pages/api/inject-documents.ts utils/server/rag-ingest.ts tests/rag-ingest.test.ts
npx tsc --noEmit --pretty false
npm run lint -- --file pages/api/inject-documents.ts --file utils/server/rag-ingest.ts --file tests/rag-ingest.test.ts
git diff --check

kkudumu · 2026-05-14T14:15:56Z

Added a short demo video artifact for Algora/reviewer convenience:

Demo video: https://raw.githubusercontent.com/kkudumu/aimengpt/codex/rag-ingest-hardening/ui/docs/rag-ingest-demo.mp4
The video summarizes the PDF/RAG ingestion hardening path and the verification commands from the PR body.

This is supplemental evidence for review; the code and tests remain the source of truth.

kkudumu · 2026-05-14T15:07:18Z

Updated the existing demo video with narrated voiceover explaining the ingestion-hardening changes, safer PDF handling, blank-chunk filtering, citation metadata preservation, and test coverage. The PR's existing demo-video link now points to the narrated MP4.

Harden RAG document ingestion

2573c77

Handle empty RAG ingestion chunks

94474be

Preserve citation metadata for RAG chunks

0bc1bd2

Add RAG ingestion demo video

c7a9f13

Add narrated RAG ingestion demo video

7fb1f8d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden RAG PDF ingestion#8

Harden RAG PDF ingestion#8
kkudumu wants to merge 5 commits into
aietal:masterfrom
kkudumu:codex/rag-ingest-hardening

kkudumu commented May 14, 2026 •

edited

Loading

Uh oh!

kkudumu commented May 14, 2026

Uh oh!

kkudumu commented May 14, 2026

Uh oh!

kkudumu commented May 14, 2026

Uh oh!

kkudumu commented May 14, 2026

Uh oh!

kkudumu commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kkudumu commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why this helps the scientific RAG bounty

Demo

Verification

AI-Assisted Disclosure

Uh oh!

kkudumu commented May 14, 2026

Uh oh!

kkudumu commented May 14, 2026

Uh oh!

kkudumu commented May 14, 2026

Uh oh!

kkudumu commented May 14, 2026

Uh oh!

kkudumu commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kkudumu commented May 14, 2026 •

edited

Loading